A Computational Grammar of Sinhala for English-sinhala Machine Translation

نویسندگان

  • B. Hettige
  • Budditha Hettige
چکیده

Communication is fundamental to the evolution and development of all kinds of living beings. With no disputes, languages should be recognized as the most amazing artifacts ever developed by mankind to enable communication. Computer has also become such a unique machine, due to its capacity to communicate with humans through languages. It is worth mentioning that the languages understood by computers and humans are quite different, yet people can communicate with computers. This has been possible since the computer is fundamentally an artifact that can translate one language to another. Therefore, computers must be able to do language translations than any other computing task. Nowadays, computing is evolving to enable machine-machine communication with no or little human intervention, yet humans continue to face with what is called language barrier for communication. In particular, a vast collection of world knowledge written in English has been inaccessible to communities who cannot communicate in English. Such communities are unable to contribute to the development of world knowledge due to the language barrier. As a result many people have embarked into research in computer aided natural language translation. This area is commonly known as Machine Translation. Among others, Aptium, Bable fish, Google translator, SYSTRAN, EDR, Anusaaraka, AngalaHindi, AnagalaBarathi, and Mantra are some examples for popular machine translation systems. These systems use various approaches including Human-assisted, Rule-based, Corpus-based, Knowledgebased, Hybrid and Agent-based to translate from one language to another. However, due to inherent diversifications of natural languages, a generic machine translation approach is far from reality. This thesis presents a computational grammar for Sinhala language to develop English to Sinhala machine translation system with an underlying theoretical basis. This system is known as BEES, an acronym for Bilingual Expert for English to Sinhala machine translation. The concept of Varanegeema (conjugation) in Sinhala language has been considered as the philosophical basis of this approach to the development of BEES. The Varanegeema in Sinhala language is able to handle large number of language primitives associated with nouns and verbs. For instance, Varanegeema handles the language primitives such as person, gender, tense, number, preposition and subjectivity/objectivity. More importantly, Varanegeema allows deriving all associated word forms from a given base word. This enables to drastically reduce the size of the Sinhala dictionary. Since the concept of Varanegeema can be expressed by a set of rules, it nicely goes with rule-based implementation of machine translation systems. BEES implements 85 grammar rules for Sinhala nouns and 18 rules for Sinhala verbs. BEES compresses with seven modules namely English Morphological analyzer, English Parser, English to Sinhala base word translator, Sinhala Morphological Generator, Sinhala Parser, Transliteration module and Intermediate Editor. In addition to the main modules, system comprises of four dictionaries, namely, English dictionary, Sinhala dictionary, English-Sinhala Bilingual dictionary and the Concept dictionary. BEES primarily shares the features with the Rule-based, Context-based and Human-assisted approaches to machine translation. The BEES has been implemented using Java and Swi-Prolog to run on both Linux and Windows environments. The English to Sinhala Machine Translation system, BEES has been evaluated to test the hypothesis that concepts of Varanegeema can be used to drive English to Sinhala machine translation. The English to Sinhala machine translation system has been evaluated through three steps. As the first step, all the language processing primitives such as morphological analyzers, parsers, translator and the transliteration module have been tested through the white box testing approach. In order to test each module, several online testing tools ii including English morphological analyzer, English parser and Sinhala word generator have been implemented. By using these online tools each module has been completely tested through a carefully created test plan. In addition, an online evaluation test bed has also been implemented to continuously capture feedback from online users. This online evaluation test bed gives facilities to make different types of sentences using a given set of words. Word Error Rate and the Sentence Error Rate were calculated by using these evaluation results. Finally the intelligibility and the accuracy tests have been conducted through the human support. In order to evaluate the intelligibility and the accuracy of the English to Sinhala machine translation system, following steps were followed. Two hundred sample sentences were collected and grouped into 20 sets (10 sentences per each set). Then each sentence was translated using the English to Sinhala Machine Translation system. Each set was given to the human translators and scored. The intelligibility and the accuracy were calculated through the above evaluation results. The experimental result shows that English morphological analyzer, English parser, English to Sinhala base word translator, Sinhala morphological generator and the Sinhala sentence generator successfully work with more than 90% accuracy. Overall result of the evaluation shows 89% accuracy with the word error rate of 7.2% and the sentence error rate of 5.4%. The BEES successfully translates English sentences with simple or complex subjects and objects. The translation system successfully handles most commonly used patterns of the tenses including active and passive voice forms.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ESANA: Hybrid Machine Translation Approach for English-to-Sinhala Language Translation

In modern society Internet has become the most popular and efficient communication media. Most of the Internet resources are available on English language. However, English fluency rate in majority of the countries is not up to a satisfactory level. In Sri Lanka, it was observed that the English fluency rate has been reduced over past 30 years. Therefore the neediness of an English-to-Sinhala t...

متن کامل

A Statistical Machine Translation Approach to Sinhala-Tamil Language Translation

Data-driven approaches to Machine Translation have come to the fore of Language Processing Research over the past decade. The relative success in terms of robustness of Example Based and Statistical approaches have given rise to a new optimism and an exploration of other data-driven approaches such as Maximum Entropy language modeling. Much of the work in the literature however, largely report ...

متن کامل

Dialogue Act Recognition for Text-based Sinhala

This paper discusses the application of classical machine learning approaches to the task of Dialogue Act Recognition for text-based Sinhala. A study was carried out to identify a dialogue act tag set for Sinhala. A new corpus using Sinhala subtitles for English movies was created and was annotated with the selected dialogue acts. Evaluation of the dialogue act recognition system was performed ...

متن کامل

Multi-agent System Technology for Morphological Analysis

Machine Translation involves multiple phases including morphological, syntax and semantic analysis of source and target languages. Despite there are numerous approaches to machine translations, handling of semantics has been an unsolved research challenge. We have been researching to exploit power of multiagent Systems technology for machine translation by extending our rule-based machine trans...

متن کامل

Sinhala-Tamil Machine Translation: Towards better Translation Quality

Statistical Machine Translation (SMT) is a well-known and well established datadriven approach used for language translation. The focus of this work is to develop a statistical machine translation system for Sri Lankan languages, Sinhala and Tamil language pair. This paper presents a systematic investigation of how SinhalaTamil SMT performance varies with the amount of parallel training data us...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011